Skip to content

Conversation

@jgehrcke
Copy link
Collaborator

@jgehrcke jgehrcke commented Nov 20, 2025

+misc changes.

See commit messages.

Tests:

test_basics.bats
 ✓ test VERSION_W_COMMIT, VERSION_GHCR_CHART, VERSION [194]
 ✓ confirm no kubelet plugin pods running [169]
 ✓ helm-install oci://ghcr.io/nvidia/k8s-dra-driver-gpu/25.12.0-dev-f8fceeae-chart [9525]
 ✓ helm list: validate output [230]
 ✓ get crd computedomains.resource.nvidia.com [164]
 ✓ wait for plugin & controller pods READY [749]
 ✓ validate CD controller container image spec [169]
test_gpu_basic.bats
 ✓ 1 pod(s), 1 full GPU [5089]
 ✓ 2 pod(s), 1 full GPU each [5065]
 ✓ 2 pod(s), 1 full GPU (shared, 1 RC) [5098]
 ✓ 1 pod(s), 2 cntrs, 1 full GPU (shared, 1 RCT) [4382]
test_cd_imex_chan_inject.bats
 ✓ IMEX channel injection (single) [14871]
 ✓ IMEX channel injection (all) [12443]
test_cd_mnnvl_workload.bats
 ✓ nickelpie (NCCL send/recv/broadcast, 2 pods, 2 nodes, small payload) [11296]
 ✓ nvbandwidth (2 nodes, 2 GPUs each) [16139]
test_cd_misc.bats
 ✓ CD daemon shutdown: confirm CD status cleanup [9212]
 ✓ reject unknown field in opaque cfg in CD chan ResourceClaim [10262]
 ✓ self-initiated unprepare of stale RCs in PrepareStarted [25815]
test_cd_logging.bats
 ✓ CD controller/plugin: startup config / detail in logs on level 0 [6462]
 ✓ CD controller: test log verbosity levels [57130]
 ✓ CD daemon: test log verbosity levels [32774]
test_cd_failover.bats
 ✓ CD failover nvb2: force-delete worker pod 0 [48916]
 ✓ CD failover nvb2: force-delete all IMEX daemons [36919]
 ✓ CD failover nvb2: regular-delete worker pod 1 [54482]
test_cd_updowngrade.bats
 ✓ downgrade: current-dev -> last-stable [25450]
 ✓ upgrade: wipe-state, install-last-stable, upgrade-to-current-dev [34085]
test_gpu_stress.bats
 ✓ Stress: shared ResourceClaim across 15 pods x 5 loops [155559]

27 tests, 0 failures in 617 seconds

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 20, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jgehrcke jgehrcke force-pushed the jp/large-cd-formation branch from 9524592 to 7815f9a Compare November 20, 2025 13:29
// whitespace to the left.
for _, ip := range slices.Sorted(maps.Keys(m.ipToDNSName)) {
dnsname := m.ipToDNSName[ip]
klog.Infof("%26s -> %s", dnsname, ip)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change motivated by seeing logs like this:

Image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current patch sorts the keys, and hence the IP addresses :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, now:
image

// perform a stable sort of IP addresses before writing them to the nodes
// config file.
if !maps.Equal(newIPs, previousIPs) {
klog.Infof("IP set changed: previous: %v; new: %v", previousIPs, newIPs)
Copy link
Collaborator Author

@jgehrcke jgehrcke Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bulk of the log volume emitted by the CD daemon is dominated by this; we must not log all of this on level zero.

example:
Image


if err := pm.updateNodeStatus(ctx, status); err != nil {
return fmt.Errorf("failed to update node status: %w", err)
return fmt.Errorf("pod update: failed to update note status in CD (%s): %w", status, err)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the wrapper (workqueue) does not enrich the error message with meaningful context information, and so I added pod update: here -- makes it easier to understand what a log message means. Example:

I1119 22:10:21.531887       1 workqueue.go:197] Reconcile: pod update: failed to update note status in CD (Ready): simulated error 5 (attempt 5)

// UpdateComputeDomainNodeInfo updates the Nodes field in the ComputeDomain with
// info about the ComputeDomain daemon running on this node. Upon success, it
// reflects the mutation in `m.mutationCache`.
func (m *ComputeDomainManager) UpdateComputeDomainNodeInfo(ctx context.Context, cd *nvapi.ComputeDomain) (rerr error) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt like renaming this from UpdateComputeDomainNodeInfo to EnsureNodeInfoInCD after I repeatedly found myself slightly confused about the high-level responsibility of this method.

@jgehrcke jgehrcke force-pushed the jp/large-cd-formation branch from 7815f9a to f8fceea Compare November 20, 2025 13:44
// fails and is retried, the delay grows exponentially starting from the
// lower value up to the upper bound.
workqueue.NewTypedItemExponentialFailureRateLimiter[any](250*time.Millisecond, 3000*time.Second),
workqueue.NewTypedItemExponentialFailureRateLimiter[any](250*time.Millisecond, 3000*time.Millisecond),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

func DefaultCDDaemonRateLimiter() workqueue.TypedRateLimiter[any] {
return NewJitterRateLimiter(workqueue.NewTypedItemExponentialFailureRateLimiter[any](5*time.Millisecond, 6000*time.Millisecond), 0.5)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought quite a bit about these numbers, but of course these are just an attempt to pick something meaningful -- we will see over time if and how we want to change method and parameters.

@jgehrcke
Copy link
Collaborator Author

/ok to test f8fceea

@jgehrcke jgehrcke self-assigned this Nov 20, 2025
@jgehrcke jgehrcke moved this from Backlog to In Progress in Planning Board: k8s-dra-driver-gpu Nov 20, 2025
@jgehrcke jgehrcke added this to the v25.12.0 milestone Nov 21, 2025
@jgehrcke
Copy link
Collaborator Author

User tested with the patch. 160 nodes:

 $ cat compute_daemon_2025-11-25T122627Z.log | grep 'nodeName' |wc -l
160

Convergence here took 10 minutes:

$ cat compute_daemon_2025-11-25T122627Z.log | grep 'Successfully updated node status in CD' | cut -c 70- | sort |  ( sed -u 1q; tail -n1 ) 
2025-11-25T04:15:38.701748256-08:00 I1125 12:15:38.701691       1 podmanager.go:210] Successfully updated node status in CD (new nodeinfo: &{fargate-ip-10-50-82-160.us-west-2.compute.internal 10.50.145.152 5b7e82a9-c25b-2287-ae4d-b48e9748382b.2 8 Ready})
2025-11-25T04:25:37.713369472-08:00 I1125 12:25:37.713234       1 podmanager.go:210] Successfully updated node status in CD (new nodeinfo: &{fargate-ip-10-50-81-67.us-west-2.compute.internal 10.50.147.229 58ece828-23d3-f775-14e7-7e53d845a002.2 12 Ready})

That's good enough of an improvement compared to "infinity" and I think this is the signal that we need today to proceed with the patch.

Generally, we still go through a plethora of conflicts:

 $ cat compute_daemon_2025-11-25T122627Z.log | grep 'has been modified' | wc -l
12969

The scaling behavior (the number or conflicts seen during the convergence process, depending on the number of nodes contributing to the CD) here still is O(N^2). That's not ok. We need to get this more linear and I am sure that there's still a lot of room for us to make improvements.

@klueska
Copy link
Collaborator

klueska commented Nov 25, 2025

On thing that would help is if we had separate API server objects for each IMEX domain to put their status into (i.e. one per NVLink partiton / clique). That way each write would only be competing with maxNodesPerIMEXDomain pods trying to write their IP addresses / status, rather than all daemon pods across the entire ComputeDomain.

@jgehrcke jgehrcke force-pushed the jp/large-cd-formation branch from f8fceea to fa4e30a Compare November 25, 2025 13:40
@jgehrcke
Copy link
Collaborator Author

separate API server objects for each IMEX domain to put their status into

I also briefly thought about that. I think there are many different dimensions in solution space that we can explore. Any kind of sharding on a per-clique level may be super useful. And/or server-side applies: https://kubernetes.io/docs/reference/using-api/server-side-apply/ (I think that has a lot of potential), and probably other strategies.

@jgehrcke
Copy link
Collaborator Author

/ok to test fa4e30a

@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Nov 25, 2025

User feedback, round 4. 215 nodes -- tested with the last commit on this branch.

All CD daemons started within less than ten seconds:

$ cat compute_daemon_2025-11-25T152807Z.log | grep -oE '[A-Z][0-9]{4}.*' | grep 'Rendered IMEX' |  sort | ( sed -u 1q; tail -n1 ) 
I1125 15:16:16.240895       1 main.go:432] Rendered IMEX daemon config file with: {10.50.146.155 /imexd/nodes.cfg}
I1125 15:16:29.867632       1 main.go:432] Rendered IMEX daemon config file with: {10.50.171.18 /imexd/nodes.cfg}

We're down to ~four minutes convergence time:

$ cat compute_daemon_2025-11-25T152807Z.log | grep -oE '[A-Z][0-9]{4}.*' | grep 'Successfully updated node status in CD' |  sort | ( sed -u 1q; tail -n1 ) 
I1125 15:16:35.403857       1 podmanager.go:210] Successfully updated node status in CD (new nodeinfo: &{fargate-ip-10-50-83-160.us-west-2.compute.internal 10.50.190.155 1c9bcb8c-bc4e-89aa-fe79-d7fb50255d3c.4 0 Ready})
I1125 15:20:31.151841       1 podmanager.go:210] Successfully updated node status in CD (new nodeinfo: &{fargate-ip-10-50-80-70.us-west-2.compute.internal 10.50.172.15 1c9bcb8c-bc4e-89aa-fe79-d7fb50255d3c.4 17 Ready})

Specifically:

  • first CD daemon start at ~15:16:16
  • last CD daemon ready at ~15:20:31

Number of conflicts:

$ cat compute_daemon_2025-11-25T152807Z.log | grep 'has been modified' | wc -l
18136

The four minutes convergence time coincide with the four minutes of informer resync period chosen in one of the last few commits here.

I inspected the logs for the 'worst-case' pod and indeed found that the informer resync once again triggered recognizing the NotReady -> Ready transition:

[pod/validate-nccl-test-zvw9-w944-hjbbw-5djd5/compute-domain-daemon] 2025-11-25T07:20:30.422791104-08:00 I1125 15:20:30.422702       1 reflector.go:456] "Forcing resync" reflector="k8s.io/client-go/informers/factory.go:160"
[pod/validate-nccl-test-zvw9-w944-hjbbw-5djd5/compute-domain-daemon] 2025-11-25T07:20:30.732928480-08:00 I1125 15:20:30.732816       1 round_trippers.go:632] "Response" verb="PUT" url="https://172.20.0.1:443/apis/resource.nvidia.com/v1beta1/namespaces/ws-s
table/computedomains/validate-nccl-test-zvw9-w944/status" status="409 Conflict" milliseconds=302
[pod/validate-nccl-test-zvw9-w944-hjbbw-5djd5/compute-domain-daemon] 2025-11-25T07:20:30.733012288-08:00 I1125 15:20:30.732942       1 workqueue.go:171] Reconcile: pod update: failed to update note status in CD (Ready): error updating node status in Comput
eDomain: Operation cannot be fulfilled on computedomains.resource.nvidia.com "validate-nccl-test-zvw9-w944": the object has been modified; please apply yo
ur changes to the latest version and try again (attempt 1)

How is the "I am now ready" event (for every single CD daemon) distributed over time?

$ cat compute_daemon_2025-11-25T152807Z.log | grep -oE '[A-Z][0-9]{4}.*' | grep 'Successfully updated node status in CD' |  sort | awk -v ref="15:16:16" '
function to_sec(t,   a) {
    split(t, a, ":")
    return a[1]*3600 + a[2]*60 + a[3]
}
BEGIN {
    ref_sec = to_sec(ref)
}
{
    print to_sec($2) - ref_sec
}' | uplot hist --nbins 12
                  ┌                                        ┐ 
   [  0.0,  20.0) ┤▇▇▇▇ 3                                    
   [ 20.0,  40.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 20            
   [ 40.0,  60.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 27   
   [ 60.0,  80.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 16                  
   [ 80.0, 100.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 25      
   [100.0, 120.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 27   
   [120.0, 140.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 23        
   [140.0, 160.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 24       
   [160.0, 180.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 17                
   [180.0, 200.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 20            
   [200.0, 220.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12                       
   [220.0, 240.0) ┤▇▇▇▇▇▇▇▇ 6                                
   [240.0, 260.0) ┤▇▇▇▇▇▇▇ 5                                 
                  └                                        ┘ 

The awk cmdline translates a log line timestamp into a number: the difference in seconds to the start of the convergence at 15:16:16, and then then we simply plot a histogram showing the distribution of those numbers (frequency to the right, time upwards (in seconds)).

That's a fairly good-looking distribution. I think we should move ahead with this patch now. Separately, we should further investigate the eligibility of the normal informer-based event propagation for (not even that fast) detection of pod readiness changes. We're probably still doing something wrong in our informer/pipeline/event-handling setup.

Copy link
Collaborator

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits, but looks good in general.

Any incoming pod update should terminate the retry
loop initiated for a previously incoming pod
update. The same for any incoming CD update.

Any pod update refers to the same pod object, and
any CD update refers to the same CD object.
Explicitly use that by using hard-coded keys.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
For large CDs this makes it faster to identify
changes from the log output.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
This is less error-prone: if we treat `node ==
nil` generally as success, we may miss persisting
a pod state transition in edge cases and for
edge-case state transitions after the initial
NotReady -> Ready transition.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
During reconciliation of pod and CD updates a
number of log messages and error messages are
flowing through the system, and this change makes
it easier to understand which messages belong
together and what is actually happening.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
This reduces the amount of log volume on the
default log level for large CDs.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
This was meant to be three seconds, not 3000 seconds.

This is a node-local retry and we can easily
afford not backing off towards O(1 min) or
further.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Change upper bound from 1000 s (17 minutes) to something much less.

For formation of a larger ComputeDomain (N nodes),
many writers desire updating the same API server
object.

The overall work that has to be done by the API
server scales linearly with N: a certain number of
updates (at least N, in the best case) is
required.

Hence, in the best case (perfect serialization, no
conflicts) the time it takes overall to get all
updates in (the overall ideal convergence time C)
is scaling with N, linearly.

In the worst case, when individual updates always
conflict with each other (are performed 'at the
same time' against the same reference state),
convergence is never achieved.

Without centralized coordination, backing off
individual retriers is a way to spread out updates
over time. The nature of the distribution of those
back-offs governs how long the actual convergence
time takes compared to the ideal case C.

The ideal case C is governed by the rate R at
which the central entity can process updates.

If we naively back off exponentially without a
sane upper bound then we don't homogenously spread
the update load over time, but try to get less and
less updates injected into the system per time, as
time progresses. The attempted update rate then
falls far below R (the possible update rate). That
makes convergence unnecessarily slow.

If we do not back off enough, an opposite effect
may occur because the global rate of retries
accumulating at the central point (API server) may
always exceed R, and hence thrash resources and
slow things down compared to the theoretical
update rate maximum (in case of perfectly
serialized updates).

Hence, there is a sweet spot between both extrema.
The positioning of that sweet spot strongly
depends on R.

Summary:

1) We do not want to back off individual retriers
   too far, otherwise we operate at an update rate
   lower than necessary and artificially slow down
   the convergence process.

2) We need to back off individual retriers enough
   to prevent thrashing from slowing us and others
   down. This is critical for making sure the
   convergence time scales linearly with N
   (instead of, say, O(N**2)).

This patch primarily takes care of (1).

For (2), in the future, we may want to further
increase that upper bound after a certain amount
of time (if e.g. a 5 second cap does not result to
overall convergence after e.g. 30 minutes, it may
be worth backing off further, to remove stress
from the API server).

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
The client-go rate limiters such as
`ExponentialFailureRateLimiter` do not implement
jitter. In a user's environment, formation of a CD
across 144 nodes has shown that the absence of
jitter results in significant retry attempt
correlation across nodes -- even after ~10
retries, resulting in otherwise preventable
conflicts (and hence increased convergence time).

That effect can be diminished by adding jitter,
which should allow for

The JitterRL implementation provided by this patch
is a simple, custom implementation that I
validated with simulated errors and careful
placement of log messages.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
See issue NVIDIA#742.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Generally, output looks cleaner with that.
Makes sorting lexicographically more meaningful
(otherwise 10 comes right after 1).

Should be OK even for upgraded systems as IMEX daemon
nodes config and /etc/hosts are written in tandem.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Sort by DNS name (map value).

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
@jgehrcke jgehrcke force-pushed the jp/large-cd-formation branch from fa4e30a to 4df71bc Compare November 26, 2025 11:34
@klueska klueska added the robustness issue/pr: edge cases & fault tolerance label Nov 26, 2025
@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Nov 26, 2025

Test suite passed for the most recent commit:

$ git rev-parse HEAD
4df71bcf124cc05909ffcf47213b92405c9294f2

$ TEST_CHART_LOCAL=1 make bats
make -f tests/bats/Makefile tests
...
27 tests, 0 failures in 603 seconds

@jgehrcke jgehrcke merged commit 3a3287c into NVIDIA:main Nov 26, 2025
7 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Closed in Planning Board: k8s-dra-driver-gpu Nov 26, 2025
@jgehrcke
Copy link
Collaborator Author

/cherry-pick release-25.8

@github-actions
Copy link

🤖 Backport PR created for release-25.8: #745

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-25.8 cherry-pick/release-25.8 robustness issue/pr: edge cases & fault tolerance

Projects

Development

Successfully merging this pull request may close these issues.

DefaultPrepUnprepRateLimiter backs off too much

2 participants